Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[controller] Harden update-store workflow #1091

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

nisargthakkar
Copy link
Contributor

@nisargthakkar nisargthakkar commented Jul 31, 2024

Harden update-store workflow

This PR unifies the UpdateStore logic between child controller and parent controllers. We've faced many issues due to the structure of the code (especially VeniceParentHelixAdmin) where a set of store-update operations puts it into a non-healthy state. These are all the changes in this PR:

  1. Add a new class UpdateStoreUtils that is called by both parent and child controllers to apply the store update and perform the necessary validations.
  2. Perform store config validation after applying all configs. This helps check for config compatibility. The following validations are performed currently.
    1. Batch-only:
      1. Inc push is not allowed
      2. WC is not allowed
      3. A/A is not allowed
    2. Hybrid:
      1. Rewind time in seconds cannot be negative
      2. Both offset lag threshold and producer time lag threshold cannot be negative
      3. In multi-region mode, inc push is not allowed for NON_AGGREGATE data replication policy
      4. ACTIVE_ACTIVE data replication policy is only supported when Active-Active replication is enabled
      5. Partition count must be greater than 0
    3. In single-region mode, incremental push allowed when the store is hybrid
    4. In multi-region mode, incremental push is allowed when the store is hybrid and either of the following are true:
      1. Store has active-active replication enabled
      2. Store doesn't have active-active replication, but DataReplicationPolicy is AGGREGATE
    5. Storage quota cannot be less than 0
    6. Read quota cannot be less than 0
    7. Read quota cannot be larger than "router count * max_per_router_quota"
    8. Store cannot be active-active replication enabled if it is not also native-replication enabled
    9. Active-Active is not supported when amplification factor is more than 1
    10. Write-compute is not supported when amplification factor is more than 1
    11. Store must have a partitioner-config set
    12. Validate if the partitioner can be built with the specified partitioner configs
    13. Latest superset value schema id must map to a valid value schema id
    14. Max compaction lag < min compaction lag
    15. ETL Proxy user must be set. (We might want to deprecate this and remove the corresponding tests, but we have plans to get KIF-based ETL that could benefit from the ETLConfig in store configs)
  3. Added a new concept of "primary controller". This is the controller that is allowed to perform inferred updates. In a multi-region mode, this is the parent controller. In a single-region mode, this is a child controller.
  4. Previously each "feature" update incurred an update to the store metadata on Zk - and the corresponding watchers would get invoked. This change replaces all those updates with a single Zk update, which is done after the resulting configs are validated.
  5. The ordinal from BackupStrategy enum was being used to write to the admin channel. This is problematic when the enums evolve. Added a getValue function and used that instead.
  6. A new config: controller.external.superset.schema.generation.enabled has been added to replace controller.parent.external.superset.schema.generation.enabled because external superset schema generation must be allowed in single-region mode also. controller.parent.external.superset.schema.generation.enabled has been marked deprecated, but it has not been completely removed yet for backward compatibility reasons.
  7. Due to the changes to inferred configs, a large chunk of tests started failing as they had incorrect setups. Fixed most of such test setups.

Some side-effects of this change are:

  1. Controller now checks if the store schema can support WC at the time of enabling the WC config. There were a few tests that also used unsupported value schemas for WC.
  2. Persona validation was only present in parent controller in multi-region mode. This now works in single-region mode as well
  3. SupersetSchemaGeneratorWithCustomProp had a bug where if the first schema has a custom prop, the future superset schema generation would fail as Avro doesn't allow overriding custom props. This got caught as the update store logic now also tries creating superset schema if a store enabled RC or WC, or if it previously had a superset schema.

There are a few other changes that we should do, but are not done in this PR:
All major operations should only be allowed on the parent controller - Create a store, delete a store, add schemas. We should exclude some system stores from this check like we have for the check allowing batch push to admin in child

Recommendation for Reviewers

I recommend going through the changes in at least two passes. In the first pass, look through all the stuff that has been purely deleted. (Skip ParentControllerConfigUpdateUtils as that has been renamed to PrimaryControllerConfigUpdateUtils and it's contents partially moved to UpdateStoreUtils. So, GitHub doesn't detect is as a renaming). While doing this pass, you can also glance through various small changes that do not need much pondering over.

In the second pass, follow this review order:

  1. VeniceControllerClusterConfig, VeniceControllerConfig, VeniceControllerMultiClusterConfig
  2. VeniceControllerService
  3. Admin
  4. UpdateStoreUtils
  5. PrimaryControllerConfigUpdateUtils
  6. AdminUtils
  7. UpdateStoreWrapper
  8. VeniceHelixAdmin
  9. VeniceParentHelixAdmin
  10. SupersetSchemaGeneratorWithCustomProp
  11. All the test modifications
  12. All newly added tests

How was this PR tested?

Added new tests. Modified existing tests. GH CI

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.
  • Yes. Make sure to explain your proposed changes and call out the behavior change.

@nisargthakkar nisargthakkar force-pushed the adminSingleUpdateZk2 branch 4 times, most recently from 9edf55a to a019bc8 Compare August 1, 2024 17:15
@nisargthakkar nisargthakkar changed the title [WIP] Harden update-store workflow [controller] Harden update-store workflow Aug 1, 2024
@nisargthakkar nisargthakkar marked this pull request as ready for review August 1, 2024 19:27
Copy link
Contributor

@FelixGV FelixGV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks like a good change! I didn't have time to look at it fully yet, but I left one minor comment so far.

Copy link
Contributor

@mynameborat mynameborat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a pass. Would recommend having another pair of eyes especially on the following files due to my limited context in the code base

  • VeniceParentHelixAdmin
  • VeniceHelixAdmin
  • UpdateStoreUtils

@nisargthakkar nisargthakkar force-pushed the adminSingleUpdateZk2 branch 3 times, most recently from a296585 to 33d1620 Compare September 19, 2024 21:11
@nisargthakkar nisargthakkar force-pushed the adminSingleUpdateZk2 branch 6 times, most recently from e4ebdbe to 408d463 Compare October 1, 2024 02:37
Copy link
Contributor

@mynameborat mynameborat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I mostly made sure that the checks and description in the summary matches w/ the code changes.

Final comments on the strategy about releasing the changes

  1. Do we have branch based releases for complex changes/features?
  2. It feels merging this w/ master and finding issues during certification can be potentially difficult to deal/mitigate as surface area of the changes is large and reverting such large commit accurately (assuming other changes pile in) might be tricky.

How do plan to certify this w/ our current certification flow?

@nisargthakkar
Copy link
Contributor Author

Final comments on the strategy about releasing the changes

  1. Do we have branch based releases for complex changes/features?

No, we don't. I think since the integration tests didn't have to change apart from fixing invalid configs, I can fix all stores to make sure that they have valid configs before we merge this PR. There is a concern that this impacts more than just LI, but I think no one else is using most of the advanced features that have added validations. Also, this won't impact any reads/writes, but control paths would get impacted.

  1. It feels merging this w/ master and finding issues during certification can be potentially difficult to deal/mitigate as surface area of the changes is large and reverting such large commit accurately (assuming other changes pile in) might be tricky.

That is correct. With such a large change that modifies how update store is basically done, this is a risk I think is worth taking. There are lots of behavioral changes that make it impractical to be made in a way we can flip it on and off through feature flags. I don't think we have added any change that blocks a revert - so we have that option open.

How do plan to certify this w/ our current certification flow?

This will get merged in our code and will only get deployed in the certification cluster. It will not be exposed to user clusters and user stores until we certify the change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants